Author: Dell'Oca Mattia
Role: student in Computer Science
University: SUPSI, Lugano (Switzerland), SUPSI.ch
Date: 06/12/2022
Mail: mattia.delloca@student.supsi.ch
In this project I will analyze data related to dogs that are available for adoption in the USA.
The dataset I'm working on is taken from Kaggle.com ----> dataset
The data were gathered from a popular social network called Petfinder that help user to either adopt a dog or find a new home for their four paws friend. There are about 58k entries in the dataset, each one representing a different dog.
In this analysis I will try to answers to the following questions:
Let's start with the analysis!
First of all, we need to import our data
import pandas as pd
import plotly.express as px
df = pd.read_csv("data/allDogDescriptions.csv")
print("Total columns: ", len(df.columns))
print("Column names: ", df.columns)
print("Sample: ")
df.sample(5)
Total columns: 36
Column names: Index(['index', 'id', 'org_id', 'url', 'type.x', 'species', 'breed_primary',
'breed_secondary', 'breed_mixed', 'breed_unknown', 'color_primary',
'color_secondary', 'color_tertiary', 'age', 'sex', 'size', 'coat',
'fixed', 'house_trained', 'declawed', 'special_needs', 'shots_current',
'env_children', 'env_dogs', 'env_cats', 'name', 'status', 'posted',
'contact_city', 'contact_state', 'contact_zip', 'contact_country',
'stateQ', 'accessed', 'type.y', 'description'],
dtype='object')
Sample:
| index | id | org_id | url | type.x | species | breed_primary | breed_secondary | breed_mixed | breed_unknown | ... | status | posted | contact_city | contact_state | contact_zip | contact_country | stateQ | accessed | type.y | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29481 | 29481 | 45631095 | IL181 | https://www.petfinder.com/dog/titan-45631095/i... | Dog | Dog | Black Labrador Retriever | NaN | True | False | ... | adoptable | 2019-08-15T22:34:43+0000 | Springfield | IL | 62703 | US | IL | 2019-09-20 | Dog | This pup is probably around 6 months old |
| 49165 | 49165 | 42411935 | OK406 | https://www.petfinder.com/dog/eli-42411935/ok/... | Dog | Dog | Chihuahua | Boston Terrier | True | False | ... | adoptable | 2018-08-08T00:50:57+0000 | Mustang | OK | 73064 | US | OK | 2019-09-20 | Dog | This little guy may be a senior, but he has a ... |
| 10764 | 10764 | 45844393 | CT560 | https://www.petfinder.com/dog/caress-45844393/... | Dog | Dog | Husky | NaN | True | False | ... | adoptable | 2019-09-03T01:12:20+0000 | Pomfret Center | CT | 06259 | US | CT | 2019-09-20 | Dog | BREED EXPERIENCE NECESSARY FOR ADOPTION |
| 29485 | 29485 | 45626096 | IL11 | https://www.petfinder.com/dog/harlow-45626096/... | Dog | Dog | Australian Cattle Dog / Blue Heeler | Mixed Breed | True | False | ... | adoptable | 2019-08-15T17:06:35+0000 | Pekin | IL | 61554 | US | IL | 2019-09-20 | Dog | This little fox i mean pup is Harlow. He is a ... |
| 51513 | 51513 | 46005071 | TN917 | https://www.petfinder.com/dog/grover-46005071/... | Dog | Dog | Pomeranian | NaN | False | False | ... | adoptable | 2019-09-17T00:47:06+0000 | White House | TN | 37188 | US | TN | 2019-09-20 | Dog | Grover is a happy-go-lucky, sweet, friendly li... |
5 rows × 36 columns
Among all the columns, I will mostly focus on the followings:
# Remove entries with incorrect contact state
df = df.loc[df['contact_state'].str.len() == 2]
df = df.loc[df['sex'] != 'Unknown']
#1.1 Sex distribution
fig = px.pie(df, names = 'sex', color='sex',
title = 'Sex distribution',
category_orders={'sex' : ["Male", "Female"]},
color_discrete_sequence=['rgb(0, 0, 255)', 'rgb(255, 100, 127)'],
height = 600)
fig.update_traces(textinfo='percent+label')
fig.update(layout_showlegend=False)
fig.show()
#1.2 Age distribution
fig = px.histogram(df, x='age', category_orders={'age' : ["Baby", "Young", "Adult", "Senior"]}, title = "Age distribution")
fig.show()
#1.3 Size distribution
fig = px.pie(df, names='size', color='size', title='Size distribution',
category_orders={'size' : ["Small", "Medium", "Large", "Extra Large"]},
color_discrete_sequence=px.colors.sequential.Purp,
height = 400)
fig.update_traces(textinfo='percent+label')
fig.update(layout_showlegend=False)
fig.show()
#1.3 Amount of mixed breed and thoroughbred dogs
fig = px.histogram(df, x = "breed_mixed", title="Amount of mixed breed and thoroughbred dogs")
fig.update_layout(xaxis={'title' : 'Is breed mixed'})
fig.show()
#1.4 Distribution of pubblications over time
df['posted'] = pd.to_datetime(df['posted'], errors='coerce')
dfd = pd.DataFrame(index=df.posted)
px.histogram(dfd, x=dfd.index, title="Distribution of pubblications over time")
This graph could seem a litle confusing at first sight, although it's meaning is queit straight foreword. We have a count of insertion posted in intervalls of two weeks. Assuming that the amount of posts is similar over time, we can tell that every two weeks the amount of insertions decrease by almost an half. This means that on average one on two dogs finds a new home in about two or three weeks!
Let's start by seeing how many different breeds are present in our dataset
#2.1 Breeds
tdf = df.loc[(df["breed_mixed"] == False) & (df['breed_primary'] != 'Mixed Breed')] # thoroughbred dataframe
print('Number of different breeds: ', len(tdf["breed_primary"].unique()))
print('Breeds: ', tdf["breed_primary"].unique())
Number of different breeds: 184 Breeds: ['Shepherd' 'German Shepherd Dog' 'Dachshund' 'Chihuahua' 'Cocker Spaniel' 'Labrador Retriever' 'Terrier' 'Husky' 'Poodle' 'Jindo' 'Pit Bull Terrier' 'Siberian Husky' 'Giant Schnauzer' 'Miniature Pinscher' 'Wirehaired Terrier' 'Catahoula Leopard Dog' 'Doberman Pinscher' 'Rat Terrier' 'Irish Terrier' 'Yorkshire Terrier' 'Cairn Terrier' 'Pomeranian' 'Australian Shepherd' 'Retriever' 'English Bulldog' 'Coonhound' 'Shar-Pei' 'American Staffordshire Terrier' 'Beagle' 'Treeing Walker Coonhound' 'Xoloitzcuintli / Mexican Hairless' 'Saint Bernard' 'Dogo Argentino' 'Boxer' 'American Bulldog' 'French Bulldog' 'English Springer Spaniel' 'Standard Poodle' 'Akita' 'Mastiff' 'Italian Greyhound' 'Belgian Shepherd / Malinois' 'Chow Chow' 'Rottweiler' 'Basset Hound' 'Border Collie' 'Staffordshire Bull Terrier' 'Shiba Inu' 'Corgi' 'Cattle Dog' 'Weimaraner' 'Australian Cattle Dog / Blue Heeler' 'Black Labrador Retriever' 'Pointer' 'Great Pyrenees' 'Hound' 'Carolina Dog' 'Anatolian Shepherd' 'Alaskan Malamute' 'Shetland Sheepdog / Sheltie' 'Bull Terrier' 'Great Dane' 'Papillon' 'Black Mouth Cur' 'Akbash' 'Plott Hound' 'Jack Russell Terrier' 'Basenji' 'Maltese' 'Lhasa Apso' 'German Wirehaired Pointer' 'Shih Tzu' 'Boston Terrier' 'Golden Retriever' 'Bluetick Coonhound' 'Feist' 'Yellow Labrador Retriever' 'Miniature Schnauzer' 'Black and Tan Coonhound' 'Schnauzer' 'Redbone Coonhound' 'English Coonhound' 'Flat-Coated Retriever' 'Foxhound' 'Mountain Cur' 'Pekingese' 'Chocolate Labrador Retriever' 'Collie' 'Japanese Chin' 'Canaan Dog' 'Brussels Griffon' 'Bichon Frise' 'Dogue de Bordeaux' 'Bullmastiff' 'Cane Corso' 'Australian Kelpie' 'Cavalier King Charles Spaniel' 'Pug' 'Clumber Spaniel' 'German Shorthaired Pointer' 'Airedale Terrier' 'Parson Russell Terrier' 'Fox Terrier' 'Chinese Crested Dog' 'Norwich Terrier' 'Toy Fox Terrier' 'Spaniel' 'Dalmatian' 'Norwegian Elkhound' 'Tibetan Mastiff' 'English Pointer' 'Newfoundland Dog' 'Presa Canario' 'Rhodesian Ridgeback' 'Dutch Shepherd' 'Irish Wolfhound' 'Bernese Mountain Dog' 'White German Shepherd' 'Silky Terrier' 'Standard Schnauzer' 'Maremma Sheepdog' 'Neapolitan Mastiff' 'Scottish Terrier' 'Greyhound' 'Cockapoo' 'Blue Lacy' 'English Setter' 'Affenpinscher' 'Havanese' 'Old English Sheepdog' 'American Foxhound' 'Pembroke Welsh Corgi' 'Galgo Spanish Greyhound' 'Miniature Poodle' 'American Hairless Terrier' 'Harrier' 'Chesapeake Bay Retriever' 'Whippet' 'Caucasian Sheepdog / Caucasian Ovtcharka' 'Portuguese Podengo' 'Eskimo Dog' 'Cardigan Welsh Corgi' 'Border Terrier' 'Saluki' 'Boerboel' 'Brittany Spaniel' 'Patterdale Terrier / Fell Terrier' 'Belgian Shepherd / Sheepdog' 'Lancashire Heeler' 'Bloodhound' 'Ibizan Hound' 'Tibetan Spaniel' 'Greater Swiss Mountain Dog' 'Keeshond' 'Smooth Collie' 'Norfolk Terrier' 'Chinook' 'American Eskimo Dog' 'Wheaten Terrier' 'Curly-Coated Retriever' 'Nova Scotia Duck Tolling Retriever' 'Vizsla' 'Irish Setter' 'Wire Fox Terrier' 'Samoyed' 'Setter' 'Petit Basset Griffon Vendeen' 'Manchester Terrier' 'Rough Collie' 'Bearded Collie' 'English Shepherd' 'Schipperke' 'Bolognese' 'West Highland White Terrier / Westie' 'McNab' 'Chinese Foo Dog' 'Wirehaired Dachshund' 'Tibetan Terrier' 'German Pinscher' 'Kyi Leo' 'Coton de Tulear' 'Kuvasz' 'Spitz' 'Tosa Inu']
We can see how many breeds are reduntant: let's take 'Shepherd' as an example
#2.2 'Shepherd' breeds
mask = (tdf['breed_mixed'] == False) & (tdf['breed_primary'].str.contains('Shepherd'))
print(tdf.loc[mask, 'breed_primary'].value_counts())
German Shepherd Dog 870 Shepherd 211 Australian Shepherd 167 Belgian Shepherd / Malinois 43 Anatolian Shepherd 11 Dutch Shepherd 10 English Shepherd 5 White German Shepherd 4 Belgian Shepherd / Sheepdog 2 Name: breed_primary, dtype: int64
Most of the subcategory breeds have very few specimens, it would be more interesting to generalize those subcategories under 'Shepherd'
#2.3 Generalizing 'Shepherd'
mask2 = mask & (tdf['breed_primary'] != 'German Shepherd Dog') & (tdf['breed_primary'] != 'Australian Shepherd')
tdf.loc[mask2, 'breed_primary'] = 'Shepherd'
print(tdf.loc[mask, 'breed_primary'].value_counts())
German Shepherd Dog 870 Shepherd 286 Australian Shepherd 167 Name: breed_primary, dtype: int64
#2.4 Most common breeds
mostCommon = tdf.groupby(tdf['breed_primary']).size().sort_values(ascending = False).head(20)
tdf['breed_category'] = "Other"
mask = tdf['breed_primary'].isin(mostCommon.index)
tdf.loc[mask, 'breed_category'] = tdf.loc[mask,'breed_primary']
fig = px.histogram(tdf, y="breed_category", height=800, title='Histogram of most common breeds', labels={"breed_category":"Breed"},
category_orders={'breed_category' : ["Other"]},
color=tdf['breed_category']=='Other')
fig.update_yaxes(categoryorder='total ascending')
fig.update_layout(showlegend=False)
fig.update
fig.show()
C:\Users\matti\AppData\Local\Temp\ipykernel_13936\1424710338.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
#3.1 Distrubution of names given the dog size
import plotly.express as px
mostCommon = df['name'].str.lower().str.capitalize().value_counts().head(5)
mask = df['name'].isin(mostCommon.index)
ndf = df.loc[mask]
fig = px.histogram(ndf, x='name', color='size', title = 'Number of dogs with same name given their size',
category_orders={'name' : ['Bella', 'Max', 'Buddy', 'Daisy', 'Duke'], 'size' : ['Small', 'Medium', 'Large', 'Extra Large']},
color_discrete_sequence=px.colors.sequential.Purp)
fig.update_layout(height = 800)
fig.show()
#3.2 Distribution of house trained dogs given their size
import plotly.graph_objects as go
top_labels = ['House trained', 'Not house trained']
colors = ['rgba(38, 24, 74, 0.8)', 'rgba(71, 58, 131, 0.8)']
x_data_t = df.loc[df['house_trained'] == True].groupby('size')['size'].count()
x_data_f = df.loc[df['house_trained'] == False].groupby('size')['size'].count()
sizes = pd.Series(0, index=['Extra Large', 'Large', 'Medium', 'Small'])
x_data_t = x_data_t.reindex(sizes.index, fill_value=0)
x_data_f = x_data_f.reindex(sizes.index, fill_value=0)
x_data = []
for x in range(len(x_data_t)):
row = []
tot = x_data_t[x] + x_data_f[x]
row.append(x_data_t[x] / tot * 100)
row.append(x_data_f[x] / tot * 100)
x_data.append(row)
y_data = ['Extra Large', 'Large', 'Medium', 'Small']
fig = go.Figure()
for i in range(0, len(x_data[0])):
for xd, yd in zip(x_data, y_data):
fig.add_trace(go.Bar(
x=[xd[i]], y=[yd],
orientation='h',
marker=dict(
color=colors[i],
line=dict(color='rgb(248, 248, 249)', width=1)
)
))
fig.update_layout(
xaxis=dict(
showgrid=False,
showline=False,
showticklabels=False,
zeroline=False,
domain=[0.15, 1]
),
yaxis=dict(
showgrid=False,
showline=False,
showticklabels=False,
zeroline=False,
),
barmode='stack',
paper_bgcolor='rgb(248, 248, 255)',
plot_bgcolor='rgb(248, 248, 255)',
margin=dict(l=120, r=10, t=140, b=80),
showlegend=False,
)
annotations = []
for yd, xd in zip(y_data, x_data):
# labeling the y-axis
annotations.append(dict(xref='paper', yref='y',
x=0.14, y=yd,
xanchor='right',
text=str(yd),
font=dict(family='Arial', size=14,
color='rgb(67, 67, 67)'),
showarrow=False, align='right'))
# labeling the first Likert scale (on the top)
if yd == y_data[-1]:
annotations.append(dict(xref='x', yref='paper',
x=xd[0] / 2, y=1.1,
text=top_labels[0],
font=dict(family='Arial', size=14,
color='rgb(67, 67, 67)'),
showarrow=False))
space = xd[0]
for i in range(1, len(xd)):
# labeling the Likert scale
if yd == y_data[-1]:
annotations.append(dict(xref='x', yref='paper',
x=space + (xd[i]/2), y=1.1,
text=top_labels[i],
font=dict(family='Arial', size=14,
color='rgb(67, 67, 67)'),
showarrow=False))
space += xd[i]
fig.update_layout(title='Percentage of house trained dogs given their size', annotations=annotations)
fig.update_layout(height = 500)
fig.update_traces(textposition='inside', texttemplate = "%{x:.1f}%", textangle=0)
fig.show()
#3.2 Distribution of coat length given the country (cold/hot)
#cold = ND, MN, ME
#hot = FL, LA, AZ
states = pd.Series(0, index=['ND', 'MN', 'ME', 'FL', 'LA', 'AZ'])
top_labels = ['Short', 'Medium', 'Long']
colorsCold = ['rgba(38, 24, 74, 0.8)', 'rgba(71, 58, 131, 0.8)', 'rgba(122, 120, 168, 0.8)']
colorsHot = ['rgba(165, 42, 42, 0.8)', 'rgba(204, 85, 0, 0.8)', 'rgba(233, 116, 81, 0.8)']
mask = df['contact_state'].isin(states.index)
x_data_s = df.loc[(df['coat'] == 'Short') & mask].groupby('contact_state')['contact_state'].count()
x_data_m = df.loc[(df['coat'] == 'Medium') & mask].groupby('contact_state')['contact_state'].count()
x_data_l = df.loc[(df['coat'] == 'Long') & mask].groupby('contact_state')['contact_state'].count()
x_data_s = x_data_s.reindex(states.index, fill_value=0)
x_data_m = x_data_m.reindex(states.index, fill_value=0)
x_data_l = x_data_l.reindex(states.index, fill_value=0)
x_data = []
for x in range(len(states)):
row = []
tot = x_data_s[x] + x_data_m[x] + x_data_l[x]
row.append(x_data_s[x] / tot * 100)
row.append(x_data_m[x] / tot * 100)
row.append(x_data_l[x] / tot * 100)
x_data.append(row)
y_data = ['North Dakota', 'Minnesota', 'Maine', 'Florida', 'Lousiana', 'Arizona']
fig = go.Figure()
for i in range(0, len(x_data[0])):
count = 0;
for xd, yd in zip(x_data, y_data):
# differentiate color for hot and cold states
if(count > 2):
fig.add_trace(go.Bar(
x=[xd[i]], y=[yd],
orientation='h',
marker=dict(
color=colorsHot[i],
line=dict(color='rgb(248, 248, 249)', width=1)
)
))
else:
fig.add_trace(go.Bar(
x=[xd[i]], y=[yd],
orientation='h',
marker=dict(
color=colorsCold[i],
line=dict(color='rgb(248, 248, 249)', width=1)
)
))
count = count + 1
fig.update_layout(
xaxis=dict(
showgrid=False,
showline=False,
showticklabels=False,
zeroline=False,
domain=[0.15, 1]
),
yaxis=dict(
showgrid=False,
showline=False,
showticklabels=False,
zeroline=False,
),
barmode='stack',
paper_bgcolor='rgb(248, 248, 255)',
plot_bgcolor='rgb(248, 248, 255)',
margin=dict(l=120, r=10, t=140, b=80),
showlegend=False,
)
annotations = []
for yd, xd in zip(y_data, x_data):
# labeling the y-axis
annotations.append(dict(xref='paper', yref='y',
x=0.14, y=yd,
xanchor='right',
text=str(yd),
font=dict(family='Arial', size=14,
color='rgb(67, 67, 67)'),
showarrow=False, align='right'))
# labeling the first Likert scale (on the top)
if yd == y_data[-1]:
annotations.append(dict(xref='x', yref='paper',
x=xd[0] / 2, y=1.1,
text=top_labels[0],
font=dict(family='Arial', size=14,
color='rgb(67, 67, 67)'),
showarrow=False))
space = xd[0]
for i in range(1, len(xd)):
# labeling the Likert scale
if yd == y_data[-1]:
annotations.append(dict(xref='x', yref='paper',
x=space + (xd[i]/2), y=1.1,
text=top_labels[i],
font=dict(family='Arial', size=14,
color='rgb(67, 67, 67)'),
showarrow=False))
space += xd[i]
fig.update_layout(title='Coat length given the state', annotations=annotations)
fig.update_layout(height = 600)
fig.update_traces(textposition='inside', texttemplate = "%{x:.1f}%", textangle=0)
fig.show()
#3.3 Special needs given the age
top_labels = ['Has special needs', 'Doesn\'t have special needs']
colors = ['rgba(38, 24, 74, 0.8)', 'rgba(71, 58, 131, 0.8)']
x_data_t = df.loc[df['special_needs'] == True].groupby('age')['age'].count()
x_data_f = df.loc[df['special_needs'] == False].groupby('age')['age'].count()
ages = pd.Series(0, index=['Senior', 'Adult', 'Young', 'Baby'])
x_data_t = x_data_t.reindex(ages.index, fill_value=0)
x_data_f = x_data_f.reindex(ages.index, fill_value=0)
x_data = []
for x in range(len(x_data_t)):
row = []
tot = x_data_t[x] + x_data_f[x]
row.append(x_data_t[x] / tot * 100)
row.append(x_data_f[x] / tot * 100)
x_data.append(row)
y_data = ['Senior', 'Adult', 'Young', 'Baby']
fig = go.Figure()
for i in range(0, len(x_data[0])):
for xd, yd in zip(x_data, y_data):
fig.add_trace(go.Bar(
x=[xd[i]], y=[yd],
orientation='h',
marker=dict(
color=colors[i],
line=dict(color='rgb(248, 248, 249)', width=1)
)
))
fig.update_layout(
xaxis=dict(
showgrid=False,
showline=False,
showticklabels=False,
zeroline=False,
domain=[0.15, 1]
),
yaxis=dict(
showgrid=False,
showline=False,
showticklabels=False,
zeroline=False,
),
barmode='stack',
paper_bgcolor='rgb(248, 248, 255)',
plot_bgcolor='rgb(248, 248, 255)',
margin=dict(l=120, r=10, t=140, b=80),
showlegend=False,
)
annotations = []
for yd, xd in zip(y_data, x_data):
# labeling the y-axis
annotations.append(dict(xref='paper', yref='y',
x=0.14, y=yd,
xanchor='right',
text=str(yd),
font=dict(family='Arial', size=14,
color='rgb(67, 67, 67)'),
showarrow=False, align='right'))
# labeling the first Likert scale (on the top)
if yd == y_data[-1]:
annotations.append(dict(xref='x', yref='paper',
x=xd[0] / 2, y=1.1,
text=top_labels[0],
font=dict(family='Arial', size=14,
color='rgb(67, 67, 67)'),
showarrow=False))
space = xd[0]
for i in range(1, len(xd)):
# labeling the Likert scale
if yd == y_data[-1]:
annotations.append(dict(xref='x', yref='paper',
x=space + (xd[i]/2), y=1.1,
text=top_labels[i],
font=dict(family='Arial', size=14,
color='rgb(67, 67, 67)'),
showarrow=False))
space += xd[i]
fig.update_layout(title='Special needs given the age', annotations=annotations)
fig.update_layout(height = 600, uniformtext_minsize=12, uniformtext_mode='hide')
fig.update_traces(textposition='inside', texttemplate = "%{x:.1f}%", textangle=0)
fig.show()
#4.1 Map of dogs' origin
import plotly.offline as po
states = df.groupby('contact_state')['contact_state'].count()
data = dict(type = 'choropleth',
locations = states.index,
locationmode = 'USA-states',
z = states,
colorscale = 'Reds',
colorbar = {'title' : 'colorbar'})
layout = dict(title = 'USA map of dogs\' origin',
geo = dict(scope='usa' ,
showlakes = False,
lakecolor = 'rgb(0,191,255)'),
height = 600)
x = go.Figure(data = [data],
layout = layout)
po.iplot(x)
#4.2 Map of dogs' percentage of vaccinations
vaccinations = df.loc[df['shots_current'] == True].groupby('contact_state')['contact_state'].count()
vaccinations = vaccinations.reindex(states.index, fill_value=0)
for x in range(len(vaccinations)):
vaccinations[x] = vaccinations[x] / states[x] * 100
data = dict(type = 'choropleth',
locations = vaccinations.index,
locationmode = 'USA-states',
z = vaccinations,
colorscale = 'Blues',
colorbar = {'title' : 'colorbar'})
layout = dict(title = 'USA map of dogs\' vaccinations in percentage (%)',
geo = dict(scope='usa' ,
showlakes = False,
lakecolor = 'rgb(0,191,255)'),
height = 600)
x = go.Figure(data = [data],
layout = layout)
po.iplot(x)
#4.2 Map of thoroughbred dogs in percentage
thoroughbred = df.loc[df['breed_mixed'] == False].groupby('contact_state')['contact_state'].count()
thoroughbred = thoroughbred.reindex(states.index, fill_value=0)
for x in range(len(thoroughbred)):
thoroughbred[x] = thoroughbred[x] / states[x] * 100
data = dict(type = 'choropleth',
locations = thoroughbred.index,
locationmode = 'USA-states',
z = thoroughbred,
colorscale = 'Oranges',
colorbar = {'title' : 'colorbar'})
layout = dict(title = 'Map of thoroughbred dogs in percentage (%)',
geo = dict(scope='usa' ,
showlakes = False,
lakecolor = 'rgb(0,191,255)'),
height = 600)
x = go.Figure(data = [data],
layout = layout)
po.iplot(x)
#4.3 Map of most popular breed per country
mask = df['breed_mixed'] == False
tmp = df.loc[mask].groupby('contact_state')['breed_primary'].describe()
mostCommon = pd.Series(data=tmp.top, index=tmp.index)
mostCommon
fig = px.choropleth(df, locations = mostCommon.index,
locationmode = 'USA-states',
color = mostCommon,
height = 600,
title = 'Map of most popular dog breed for each state')
fig.update_layout( geo = dict(scope='usa' ,
showlakes = False))
fig.show()
#5.1 Most common words in all description
# importing all necessary modules
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import pandas as pd
comment_words = ''
stopwords = set(STOPWORDS)
# iterate through the csv file
for val in df['description']:
# typecaste each val to string
val = str(val)
# split the value
tokens = val.split()
# Converts each token into lowercase
for i in range(len(tokens)):
tokens[i] = tokens[i].lower()
comment_words += " ".join(tokens)+" "
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(comment_words)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.title("Most common words in dogs' descripions")
plt.tight_layout(pad = 0)
plt.show()
#5.2 Most common words for Pit Bull Terrier
comment_words = ''
stopwords = set(STOPWORDS)
mask = (df['breed_mixed'] == False) & (df['breed_primary'] == 'Pit Bull Terrier')
# iterate through the csv file
for val in df.loc[mask, 'description']:
# typecaste each val to string
val = str(val)
# split the value
tokens = val.split()
# Converts each token into lowercase
for i in range(len(tokens)):
tokens[i] = tokens[i].lower()
comment_words += " ".join(tokens)+" "
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(comment_words)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.title("Most common words for Pit Bull Terrier")
plt.tight_layout(pad = 0)
plt.show()
#5.3 Most common words for Chihuahua
comment_words = ''
stopwords = set(STOPWORDS)
mask = (df['breed_mixed'] == False) & (df['breed_primary'] == 'Chihuahua')
# iterate through the csv file
for val in df.loc[mask, 'description']:
# typecaste each val to string
val = str(val)
# split the value
tokens = val.split()
# Converts each token into lowercase
for i in range(len(tokens)):
tokens[i] = tokens[i].lower()
comment_words += " ".join(tokens)+" "
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(comment_words)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.title("Most common words for Chihuahua")
plt.tight_layout(pad = 0)
plt.show()
#5.5 Most common words for Senior Dogs
comment_words = ''
stopwords = set(STOPWORDS)
mask = (df['age'] == 'Senior')
# iterate through the csv file
for val in df.loc[mask, 'description']:
# typecaste each val to string
val = str(val)
# split the value
tokens = val.split()
# Converts each token into lowercase
for i in range(len(tokens)):
tokens[i] = tokens[i].lower()
comment_words += " ".join(tokens)+" "
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(comment_words)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.title("Most common words for Senior Dogs")
plt.tight_layout(pad = 0)
plt.show()
#5.5 Most common words for Baby Dogs
comment_words = ''
stopwords = set(STOPWORDS)
mask = (df['age'] == 'Baby')
# iterate through the csv file
for val in df.loc[mask, 'description']:
# typecaste each val to string
val = str(val)
# split the value
tokens = val.split()
# Converts each token into lowercase
for i in range(len(tokens)):
tokens[i] = tokens[i].lower()
comment_words += " ".join(tokens)+" "
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(comment_words)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.title("Most common words for Baby Dogs")
plt.tight_layout(pad = 0)
plt.show()